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Abstract 

We describe a RAM algorithm computing all runs (maximal repetitions) of a given string of length n over 
a general ordered alphabet in 0(n log ^ n) time and linear space. Our algorithm outperforms all known 
solutions working in ©(nlogcr) time provided a = where a is the alphabet size. We conjecture that 

there exists a linear time RAM algorithm finding all runs. 
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1. Introduction 

Repetitions in strings are fundamental objects 
in both stringology and combinatorics on words. 
In some sense the notion of run, introduced by 
Main m, allows to grasp the whole repetitive 
structure of a given string in a relatively simple 
form. Informally, a run of a string is a maximal 
periodic substring that is at least as long as twice 
its minimal period (the precise definition follows). 
In [3] Kolpakov and Kucherov showed that any 
string of length n contains 0{n) runs and proposed 
an algorithm computing all runs in linear time on 
an integer alphabet {0,1,..., ^.nd 0(n log a) 

time on a general ordered alphabet, where cr is the 
number of distinct letters in the input string. Re¬ 
cently, Bannai et al. described another interesting 
algorithm computing all runs in O(nlogcr) time [T]. 
Modifying the approach of [T] , we prove the follow¬ 
ing theorem. 

Theorem. For a general ordered alphabet, there is 
an algorithm that computes all runs in a string of 
length n in 0(n log ^ n) time and linear space. 

This is in contrast to the result of Main and 
Lorentz [14] who proved that any algorithm de¬ 
ciding whether a string over a general unordered 
alphabet has at least one run requires n(nlogn) 
comparisons in the worst case. 

Our algorithm outperforms all known solutions 
when the number of distinct letters in the input 
string is sufficiently large (e.g., cr = It 
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should be noted that the algorithm of Kolpakov and 
Kucherov can hardly be improved in a similar way 
since it strongly relies on a structure (namely, the 
Lempel-Ziv decomposition) that cannot be com¬ 
puted in o{n log cr) time on a general ordered al¬ 
phabet (see my 

Based on some theoretical observations of m , we 
conjecture that one can further improve our result. 

Conjecture. For a general ordered alphabet, there 
is a linear time algorithm computing all runs. 

2. Preliminaries 

A string of length n over an alphabet E is a map 
{I,2,...,n} I—)■ S, where n is referred to as the 
length of w, denoted by |iy|. We write ri;[z] for the 
ith letter of w and w[i..j] for w[z]?ii[i-|-l]... w[j]. 
A string u is a substring (or a factor) of w if 
u = w[i..j] for some i and j. The pair {i,j) is 
not necessarily unique; we say that i specifies an 
occurrence of u in w. A string can have many oc¬ 
currences in another string. A substring r(;[l..j] (re¬ 
spectively, w[i..n]) is a prefix (respectively, suffix) 
of w. An integer p is a period of w if 0 < p < |r(;| and 
= w[i+p] for alH = I, ..., \w\—p; p is the min¬ 
imal period of ic if p is the minimal positive integer 
that is a period of w. For integers i and j, the set 
{k G Z: i < k < j} (possibly empty) is denoted by 
[i..j]. Denote [i..j) = [b.j-1] and {i..j] = [i-\-l..j]. 

A run of a string w is a substring w[i..j] whose 
period is at most half of the length of w[i..j] and 
such that both substrings w[i—l..j] and r(;[b.j-|-l], if 
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defined, have strictly greater minimal periods than 
w[i..j]. 

We say that an alphabet is general and ordered 
if it is totally ordered and the only allowed opera¬ 
tion is comparing two letters. Hereafter, w denotes 
the input string of length n over a general ordered 
alphabet. 

In the longest common extension (LCE) prob¬ 
lem one has to preprocess w for queries LCE{i,j) 
returning for given positions i and j of w the length 
of the longest common prefix of the suffixes w[i..n\ 
and w[j..n\. It is well known that one can perform 
the LCE queries in constant time after preprocess¬ 
ing w in 0(n log cr) time, where a is the number of 
distinct letters in w (e.g., see H)- It turns out that 
the time consumed by the LCE queries is dominat¬ 
ing in the algorithm of [T]; namely, one can prove 
the following lemma. 

Lemma 1 (see (TJ Alg. I and Sect. 4.2]). Suppose 
we can answer in an online fashion any sequence of 
0{n) LCE queries on w in 0{f{n)) time for some 
function f(n); then we can find all runs of w in 
0{n + f{n)) time. 

In what follows we describe an algorithm that 
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computes 0{n) LCE queries in 0(nlog=* n) time 
and thus prove Theorem using Lemma The key 
notion in our construction is a difference cover. Let 
fc G N. A set He [Q..k) is called a difference cover 
of [0..fc) if for any x G [0..fc), there exist y,z G 
D such that y — z = x (mod k). Clearly |H| > 
\/k. Conversely, for any A: G N, there is a difference 
cover of [0..fc) with 0{'/k) elements: for example, 
the difference cover [0..[-\/fcJ]U{2[-\/fcJ, 3[-\/fcJ,...}, 
which is depicted in Fig.[^ For further discussions 
and estimations of minimal difference covers, see 

[Hinille]. 

k 

. --V 

• 900^ 900^ 900^ mo 

[\/fcj [CfcJ Wk\ [Cfcj 

Figure 1: Simple difference cover of [0..k) with k = 18. 

Example. The set D = {1,2,4} is a difference 
cover of [0..5). 
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Our algorithm utilizes the following interesting 
property of difference covers. 


Lemma 2 (see 0). Let D be a difference cover of 
[0..fc). For any integers i,j, there exists d G [0..k) 
.such that (i + d) mod k G D and {j + d) mod k G D. 

3. Longest Common Extensions 

At the beginning, our algorithm fixes an integer 
T (the precise value of t is given below). Let D 
be a difference cover of [0..t^) of size 0(t). De¬ 
note M = {i G [l..n]: (i mod r^) G Hj. Obvi¬ 
ously, we have |M| = 0(y). Our algorithm builds 
in 0(^(r^ -I- logn)) = 0{^ logn -|- nr) time a data 
structure that can calculate LCE{x,y) in constant 
time for any x,y G M. To compute LCE{x,y) for 
arbitrary x,y G [l..n], we simply compare w[a::..n] 
and w[y..n\ from left to right until we reach posi¬ 
tions X + d and y + d such that x + d G M and 
y + d G M, and then we obtain LCE{x,y) = d + 
LCE{x+d, y+d) in constant time. By Lemmaj^ we 
have d < t'^ and therefore, the value LCE{x, y) can 
be computed in O(t^) time. Thus, our algorithm 
can execute any sequence of 0{n) LCE queries in 
0{^ logn -|- nr^) time. Putting r = [logs n], we 
obtain C)(" logn-|-nr^) = 0(nlogS n). Now it suf¬ 
fices to describe the data structure answering the 
LCE queries on the positions from M. 

Let be the sequence of all 

positions from M in the increasing lexico¬ 
graphical order of the corresponding suffixes 
w[ii..n\,w[i 2 ..n\,... ,w[im.-n]. Our algorithm 
builds a longest common prefix array lcp[l..m—1] 
such that lcp[j] = LCE{ij,ij+i) for j G [l..rn) and 
a sparse suffix array sa[l..n] such that isa[x] = x 
ioi X G M and sa[a;] = 0 for a; ^ M. Obviously 
LCE{ij,ik) = min{lcp[j],lcp[j-bl],...,lcp[/c-l]} 
for j < k. Based on this observation, we 
equip the Icp array with the range minimum 
query (RMQ) structure [5] that allows to compute 
min{lcp[j], lcp[j-|-l],..., lcp[fc—1]} for any j < k in 
0(1) time. Now, to answer LCE{x, y) for x^y G M, 
we first obtain j = sa [a:] and k = sa [y] and then an¬ 
swer LCE{ij,ik) using the RMQ structure on the 
Icp array. Since the RMQ structure can be built in 
0{n) time [5], it remains to describe how to con¬ 
struct Icp and sa. 

In general our construction is similar to that 
of [in] . We use the fact that the set M has “period” 
T^, i.e., for any x G M, we have a:-|-T^ G M provided 
X + < n. For simplicity, assume that w[n] is a 

special letter that is smaller than any other letter 
in w. Our algorithm iteratively inserts the suffixes 
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{w[x..n]: X S M} in the arrays Icp and sa from 
right to left. Suppose, for some k S M, we have 
already inserted in Icp and sa the suffixes w[x..n] 
for all a; G M n {k..n]. More precisely, denote by 
* 1 ) ■ • ■) *m'sequence of all positions Mr\{k..n] 

in the increasing lexicographical order of the corre¬ 
sponding suffixes w[i[..n], w[i 2 --n], ..., w[i'j^,..n\; we 
suppose that lcp[j] = LCE{i'j,i'j_^_i) for j G 
^'s3[x\ = X iov X € M D (fc..n], and sa[a;] = 0 for 
X ^ M f] {k..n]. We are to insert the suffix w[k..n] 
in Icp and sa. In order to perform the insertions effi¬ 
ciently, during the construction, the arrays Icp and 
sa are represented by balanced search trees with 
some auxiliary structures as described below. 

1 . Balanced search tree for Icp. The Icp array is 
represented by an augmented balanced search tree 
so that any RMQ query and modification on Icp 
take O(logn) amortized time. 

2. List L. We store all positions M n {k..n] on 
a linked list L in the lexicographical order of the 
corresponding suffixes. We maintain on this list the 
order maintenance data structure of [2] that allows 
to determine whether a given node of L precedes 
another node of L in constant time. The insertion 
of a new node in L takes amortized constant time. 
To provide constant time access to the nodes of L, 
we maintain an array nds[1..7T,] such that nds[a;] is 
the node of L corresponding to position cc if a; G 
M n {k..n], and nds[a;] = nil otherwise. 

3. Balanced search tree for sa . It is straightforward 
that, for any x G {k..n], sa[a;] is equal to one plus the 
number of nodes of L preceding nds[a;]. So, we store 
all nodes of L in an augmented balanced search tree 
allowing to calculate the number of nodes preceding 
nds[a;] in 0(log n) time (since the comparison of two 
nodes takes 0(1) time). This tree together with the 
list L and the array nds allows to compute sa[x] in 
O(logn) time. 

4- Trie S. We maintain a compacted trie S that 
contains the strings w[x..x+t'^\ for all x G Mn(fc..n] 
(we assume w[j] = w[n] for all j > n and thus 
w[x..x+t'^] is always well defined). We maintain 
on S the data structure of [5] supporting insertions 
in 0(r^ -(- logn) amortized time. Let a be the leaf 
of S corresponding to a string ?ii[a;..a::-|-r^]. We aug¬ 
ment a with a balanced search tree Ba that contains 
nodes nds[?/] for all positions y G M n (fc..rd such 
that w[y—T'^..y\ = w[x..x+t'^] (see Figure^. We 


use Ba to compute in O(logn) time the immediate 
predecessor and successor of any given node nds[z], 
where z G M n (fc..n], in the set of nodes stored in 
Ba- It is easy to see that S together with the asso¬ 
ciated search trees occupies 0(") space in total. 

Example. Let = 4. The set D = {0,1,3} is 
a difference cover of [0..r^). Consider the string 
w = ab cab c aba b cab b %: the underlined positions are 
from M = (i G [l..n]: (i mod r^) G D}. Figure]^ 
depicts the compacted trie 5; each leaf of S is aug¬ 
mented with a balanced search tree of certain posi¬ 
tions from M n (fc..n] (we use positions rather than 
nodes in this example). Consider the leaf of S cor¬ 
responding to the string abcab. The string abcab 
occurs at positions 4, 9,1 in w. Hence, the bal¬ 
anced search tree B 4 must contain three positions: 
4-I-t^ = 8,9-l-r^ = 13, H-T^ = 5. Note that the 
positions are stored in the lexicographical order of 
the corresponding suffixes ui[8..n], w[13..n],■u;[5..n]. 



Figure 2: The balanced search trees Bi, B 2 , •.., Bg are aug¬ 
mented with some positions from M. 

The construction of Icp and sa. To insert wlk-.n] 
in Icp and sa, we first insert w\k..k+T‘^\ in S in 
0(r^ -I- logn) time. If S did not contain the string 
w[k..k+T‘^] before, then, using auxiliary structures 
on S', we easily find in 0(1) time the position in 
Icp where the suffix w[k..n] should be inserted; in 
the same way we obtain the LCE value between 
w[k..n] and its immediate predecessor and successor 
in S. Then, we modify the balanced search tree 
representing Icp, insert a new node corresponding 
to w[k..n\ in L, insert this node in the balanced 
search tree supporting sa, and, finally, add a new 
empty tree Ba to the newly created leaf a of S. All 
these modifications take O(logn) amortized time. 

Now suppose S contains wlk-.k+T"^]. Denote by 
a the leaf of S corresponding to w[k..k+T‘^]. In 
O(logn) time we obtain the immediate predecessor 
and successor of the node nds[fc-|-T^] (recall that 
k+r"^ G M) in the search tree Ba] denote these 
nodes by nds[a:] and nds[j/], respectively. (We as¬ 
sume that the predecessor and successor both are 
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defined; the case when one of them is undefined 
is analogous). Note that nds[x] is the immediate 
predecessor only in the set of all nodes contained 
in Ba but it may not be the immediate predeces¬ 
sor in the whole list L; the situation with nds[j/] is 
similar. Then we insert nds[A:-|-r^] between nds[x] 
and nds[ 2 /] in Ba- Since w[x—t'^..x] = w[y—T^..y] = 
w[k..k+T‘^], it is straightforward that the suffixes 
w[x—T^..n] and w[y—T'^..n] are, respectively, the 
immediate predecessor and successor of the suf¬ 
fix w[k..n] in the set of all suffixes {w[x..n]: x G 
M n {k..n]}. Hence, we insert a new node nds[fc] 
in L between the nodes nds[x— r^] and nds[?/— t^] 
(these nodes are certainly adjacent). 

It is easy to see that LCE{k,x—T‘^) = 
-I- LCE{k+T^,x) and LCE{k,y—T^) = 

-I- LCElk+r"^,y). The values LCE{k+T^,x) = 
and LCE{k+T‘^,y) 
can be computed in O(logn) 
time using the balanced search trees supporting ac¬ 
cess on sa and RMQ queries on Icp. All subsequent 
changes of other structures are the same as in the 
previous case and require O(logn) amortized time. 

Finally, once the last suffix is inserted, we con¬ 
struct in an obvious way the plain arrays Icp and sa 
in 0 {n) time. 

Time and space. The insertion of a new suffix in 
the arrays Icp and sa takes 0(r^ -l-logn) amortized 
time. Thus, the construction of Icp and sa con¬ 
sumes overall 0(y(r^ -I- logn)) time as required. 
The whole data structure occupies 0{n) space. 

4. Conclusion 

It seems that further improvements in the con¬ 
sidered problem may be achieved by more efficient 
longest common extension data structures on a gen¬ 
eral ordered alphabet. One even might conjecture 
that there is a data structure that can execute any 
sequence of k LCE queries on a string of length n 
over a general ordered alphabet in 0 {k n) time. 
However, we do not yet have a theoretical evidence 
for such strong results. 

Another interesting direction is a generalization 
of our result for the case of online algorithms (e.g., 
see [5] and [H]). 
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